Databricks Custom Transformation Job

Data transformation is the process of converting, cleansing, and structuring data into a usable format that can be analyzed to support decision making processes. The data transformation process converts raw data into a usable format by removing duplicates, converting data types, and enriching the dataset. This dataset can then be used for data analysis or as an input for AI/ML processes.

Lazsa Data Pipeline Studio (DPS) provides templates for creating transformation jobs. The jobs include join/union/aggregate functions that can be performed to group or combine data for analysis.

For complex operations to be performed on data, Lazsa DPS provides the option of creating custom transformation jobs. For custom queries a template is provided with placeholder code. You can navigate to Databricks Notebook and replace the placeholder code with your own custom code.

To create a Databricks custom transformation job

  1. Log on to the Lazsa Platform and navigate to Products.

  2. Select a product and feature. Click the Develop stage of the feature and navigate to Data Pipeline Studio.

  3. Create a pipeline with the following nodes:

    Note: The stages and technologies used in this pipeline are merely for the sake of example.

    • Data Lake - Amazon S3

    • Data Transformation - Databricks

    Databricks Transformation job

  4. Configure the data lake and data transformation nodes.

  5. In the data transformation stage click the Databricks node, and select Create Custom Job - to create a custom transformation job.

  6. Complete the following steps to create the Databricks custom transformation job:

To replace the placeholder custom code

After you have created the custom transformation job, click the Databricks Notebook icon. This navigates you to the custom transformation job in the Databricks UI. Replace the code with your custom code and then run the job.

Navigate to Databricks Notebook

Note:

In the sample code provided in Databricks Notebook, if you delete the job run parameters like Records Processed and Time Taken it may not provide the accurate time taken for job run which is displayed in View Details on the UI. This is because the time taken will include the time required to start the Databricks cluster, if it is not already running.

Related Topics Link IconRecommended Topics What's next? Snowflake Custom Transformation Job